Problem 1


Introduction


Methods


Loading Data

# loading data
kang_seed <- read.table(here("data/sev208_kratseedbank_20120213.txt"), sep=',', header = TRUE) |> 
             mutate(mnd = as.factor(mnd)) |> 
             filter(!(species %in% c("soil", "plant", "dist", "litter", "gravel")))


Missing Data

gg_miss_var(kang_seed)


Skim

skim(kang_seed)
Data summary
Name kang_seed
Number of rows 960
Number of columns 5
_______________________
Column type frequency:
character 3
factor 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
dir 0 1 1 1 0 4 0
loc 0 1 1 1 0 4 0
species 0 1 4 6 0 8 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
mnd 0 1 FALSE 10 18: 128, 23: 128, 25: 128, 6: 96

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
seeds 0 1 11.1 48.21 0 0 0 3 656 ▇▁▁▁▁


Visualize Data

# visualize counts
ggplot(kang_seed, aes(x = seeds)) +
  geom_histogram(bins = 17)

#visualize using box plot
ggplot(data = kang_seed, aes(x = mnd, y = seeds)) +
  geom_boxplot() +
  labs(x = "Mound Number", y = "NUmber of Seeds", title = "Boxplot of Seeds Found at Different Mounds", 
       caption = "Figure 1: ") +
  theme_bw()

# set up data for bar graph
kang_seed_bar <- kang_seed |> 
                 group_by(mnd) |> 
                 summarise(across(seeds, sum)) |> 
                 ungroup()
# bar graph
ggplot() +
  geom_bar(data = kang_seed_bar, aes(x = mnd, y = seeds, fill = mnd), 
           stat = "identity", show.legend = F) +
  theme_bw() +
  labs(x = "Mound Number", y = "Number of Seeds", 
       title = "Total Seed Count per Mound", caption = ) 


Question: How does total seed number differ between kangaroo rat mound locations?


Choose an Initial Test

We will start by attempting to run an ANOVA test to determine if there is a difference in the total seed count between kangaroo rat mound locations.


Set Up Initial Hypothesis for ANOVA

\(H_0:\) There is no significant difference in seed counts between the kangaroo rat mound locations.
\(H_a:\) There is a significant difference in seed counts between at least two of the kangaroo mound locations.


Initial Assumption Checks

# make aov object
kang_aov <- aov(seeds ~ mnd, data = kang_seed)

# get resiudals
kang_res <- kang_aov$residuals

# check normality
qqPlot(kang_res)

## [1] 899 195

The data does not appear to be normal, which we expected because in the context reading they discussed how in their analysis the data was non-normal even after trying transformations. We decided to try a log transformation regardless because this would be the logical next step in this analysis.

# transform data
kang_seed_log <- kang_seed |> 
                 mutate(seeds = log(seeds + 1))

# make new aov object
kang_log_aov <- aov(seeds ~ mnd, data = kang_seed_log)

# residuals
kang_log_res <- kang_log_aov$residuals

#test normality
qqPlot(kang_log_res)

## [1] 899 195

The data is still non-normal as expected (we stated above that the context paper mentioned how the data is not normal after all transformations).


Kruskal-Wallis Test

Due to this we are going to run a Kruskal-Wallis test, for which we meet all assumptions (there are categorical predictor variables, there are independent samples, and each group has at least five observations) and works with count data. This will change our null and alternative hypothesis to be:

\(H_0:\) There is no difference in the median total seed number between the kangaroo rat mound locations.
\(H_a:\) There is a difference in at least two kangaroo rat mound locations median total seed number.

# show aov summary
kruskal.test(seeds ~ mnd, data = kang_seed)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  seeds by mnd
## Kruskal-Wallis chi-squared = 16.997, df = 9, p-value = 0.04876

After running the Kruskal-Wallis test, we reject the null hypothesis. (See results).


Results


Problem 2


Introduction


Methods


Loading Data

# loading data
shrub_raw <- read.csv(here("data/shrubstudy_seed.csv"))

shrub_seed <- shrub_raw |>
              dplyr::select(treatment, species, total_inf = total_nr_infl, 
                      num_seeds = nr_seeds, shrub_num, plant_num = plant_nr, tag_num) |>
              mutate(treatment = as.factor(treatment))


Missing Data

# visualize missing data
gg_miss_var(shrub_seed)


Skim

skim(shrub_seed)
Data summary
Name shrub_seed
Number of rows 287
Number of columns 7
_______________________
Column type frequency:
character 1
factor 1
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1 6 6 0 19 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
treatment 0 1 FALSE 2 con: 174, shr: 113

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_inf 0 1.00 5.15 10.45 1 1.00 1 5.00 117 ▇▁▁▁▁
num_seeds 105 0.63 14.55 28.62 0 1.25 5 13.75 285 ▇▁▁▁▁
shrub_num 0 1.00 29.53 16.04 5 13.00 30 44.00 54 ▆▁▇▂▅
plant_num 0 1.00 2.43 1.34 1 1.00 2 3.00 5 ▇▆▅▃▂
tag_num 0 1.00 143.48 63.71 22 104.00 159 181.00 300 ▃▂▇▂▁


Subset Dataset

# filter out missing data
shrub_seed_sub <- shrub_seed |> 
                  drop_na(num_seeds) 


Histogram of Counts

ggplot(shrub_seed_sub, aes(x = num_seeds)) +
  geom_histogram(bins = 17)


Histogram of Species

ggplot(shrub_seed_sub, aes(x = species)) +
  geom_bar(stat = 'count')


One thing to note from this plot is that TRIDAS appears to only have one sample. Because of this, in the future, the predictions are likely to be all over the place and have a very large confidence interval. In order to increase the interpretability of the following plots, we will remove TRIDAS.


# filter out missing data
shrub_seed_sub <- shrub_seed_sub |> 
                  subset(species != "TRIDAS")


Variable Relationships

shrub_seed_sub |> 
  dplyr::select(!num_seeds) |> 
  ggpairs()


Question: How does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences? Is there a simpler model that explains seed count, and if so, what is it?


Build Models

# linear model, we know this is wrong
shrub_model1 <- lm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub)

# generalized linear model with Poisson distribution
shrub_model2 <- glm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "poisson")

# generalized linear model with negative binomial distribution
shrub_model3 <- glmmTMB(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "nbinom2")

# generalized linear model with Poisson distribution and random effect of treatment
shrub_model4 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num), 
                          data = shrub_seed_sub, family = "poisson")

# generalized linear model with negative binomial distribution and random effect of treatment
shrub_model5 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num), 
                      data = shrub_seed_sub, family = "nbinom2")

Because we are looking at count data, we know that the data is discrete and only has a lower bound. Knowing this, we built a couple different models using the Poisson and Negative Binomial distribution.


Model 1
# check diagnostics
simulationOutput_m1 <- simulateResiduals(shrub_model1)

plot(simulationOutput_m1)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m1)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.96669, p-value = 0.744
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m1)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = Inf, p-value < 2.2e-16
## alternative hypothesis: two.sided


Model 2
# check diagnostics
simulationOutput_m2 <- simulateResiduals(shrub_model2)

plot(simulationOutput_m2)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m2)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 21.719, p-value < 2.2e-16
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m2)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 3.992, p-value < 2.2e-16
## alternative hypothesis: two.sided


Model 3
simulationOutput_m3 <- simulateResiduals(shrub_model3)

plot(simulationOutput_m3)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m3)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.4263, p-value = 0.24
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m3)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.4202, p-value = 0.024
## alternative hypothesis: two.sided


Model 4
# check diagnostics
simulationOutput_m4 <- simulateResiduals(shrub_model4)

plot(simulationOutput_m4)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m4)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.83908, p-value = 0.464
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m4)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 3.1189, p-value < 2.2e-16
## alternative hypothesis: two.sided


Model 5
# check diagnostics
simulationOutput_m5 <- simulateResiduals(shrub_model5)

plot(simulationOutput_m5)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m5)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.5972, p-value = 0.24
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m5)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.4281, p-value = 0.048
## alternative hypothesis: two.sided


Breakdown of models 1-5:
Model Formula Distribution QQ Plot Residuals Residual vs. Predicted Over/Under Dispersion Zeros
shrub_model1 num_seeds ~ treatment + species + total_inf Normal F F Under Too many
shrub_model2 num_seeds ~ treatment + species + total_inf Poisson F F Over Too many
shrub_model3 num_seeds ~ treatment + species + total_inf Negative Binomial P F None Too many
shrub_model4 num_seeds ~ treatment + species + total_inf + (1|shrub_num) Poisson F F Over Too many
shrub_model5 num_seeds ~ treatment + species + total_inf + (1|shrub_num) Negative Binomial P F None Too many

Based on the tests above, it appears that either shrub_model3 or shrub_model5 will be the best models because neither are over or under-dispered. However, they do still have some patterns in the residuals, and they appear to be zero-inflated. To further our model, we will create a new model with a zero-inflated term.


Model 6
# generalized linear model with negative binomial distribution
shrub_model6 <- glmmTMB(num_seeds ~ treatment + species + total_inf, ziformula = ~1,
                        data = shrub_seed_sub, family = "nbinom2")

With the zero-inflated model created, we will now run the necessary diagnostics.

# check diagnostics
simulationOutput_m6 <- simulateResiduals(shrub_model6)

plot(simulationOutput_m6)

par(mfrow = c(1,2))

testDispersion(simulationOutput_m6)
## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 1.1974, p-value = 0.472
## alternative hypothesis: two.sided
testZeroInflation(simulationOutput_m6)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 1.0479, p-value = 0.76
## alternative hypothesis: two.sided

Based on the above plots, shrub_model6 appears to be our best model. There is no over or underdispersion, and our distribution has about the same number of zeros as our model would predict. There is still some trend in our residuals, but it appears to have been lessened when compared to shrub_model3. We acknowledge this trend in residuals and have decided that isn’t a big cause for concern since the other diagnostics were met.

Moving forward, we will check AICC values to test whether our assumptions about the models are correct.


Which distribution to use?

MuMIn::model.sel(shrub_model1, shrub_model2, shrub_model3, shrub_model4, shrub_model5, shrub_model6)
## Model selection table 
##              (Intr) spc ttl_inf trt cnd((Int)) dsp((Int)) cnd(spc) cnd(ttl_inf)
## shrub_model6                             2.156          +        +      0.06634
## shrub_model3                             1.917          +        +      0.07502
## shrub_model5                             1.915          +        +      0.07505
## shrub_model1 -2.567   + 2.15000   +                                            
## shrub_model4                             2.488          +        +      0.03145
## shrub_model2  2.549   + 0.02989   +                                            
##              cnd(trt) zi((Int)) family   class random df    logLik   AICc
## shrub_model6        +    -2.283 n2(lg) glmmTMB         9  -548.867 1116.8
## shrub_model3        +           n2(lg) glmmTMB         8  -552.052 1120.9
## shrub_model5        +           n2(lg) glmmTMB c(s_n)  9  -551.884 1122.8
## shrub_model1                    gs(id)      lm         8  -689.162 1395.2
## shrub_model4        +           ps(lg) glmmTMB c(s_n)  8 -1055.923 2128.7
## shrub_model2                    ps(lg)     glm         7 -1147.380 2309.4
##                delta weight
## shrub_model6    0.00  0.852
## shrub_model3    4.15  0.107
## shrub_model5    6.03  0.042
## shrub_model1  278.37  0.000
## shrub_model4 1011.90  0.000
## shrub_model2 1192.62  0.000
## Abbreviations:
##  family: gs(id) = 'gaussian(identity)', n2(lg) = 'nbinom2(log)', 
##          ps(lg) = 'poisson(log)'
## Models ranked by AICc(x) 
## Random terms: 
##  c(s_n): cond(1 | shrub_num)

This chart confirms our assumptions about our models; shrub_model6 is best, followed by shrub_model3 and shrub_model5. We know this because of their respective AICC values of 1127.6, 1132.0, and 1133.9. In general, a lower AICC value equates to a better model.

Next, we will look closer at our model: shrub_model6.


Model Summary

# summary
summary(shrub_model6)
##  Family: nbinom2  ( log )
## Formula:          num_seeds ~ treatment + species + total_inf
## Zero inflation:             ~1
## Data: shrub_seed_sub
## 
##      AIC      BIC   logLik deviance df.resid 
##   1115.7   1144.5   -548.9   1097.7      172 
## 
## 
## Dispersion parameter for nbinom2 family (): 2.18 
## 
## Conditional model:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.155988   0.212883  10.128  < 2e-16 ***
## treatmentshrub -0.299250   0.131830  -2.270   0.0232 *  
## speciesCARRUP  -1.718777   0.281984  -6.095 1.09e-09 ***
## speciesGEUROS  -0.238970   0.242542  -0.985   0.3245    
## speciesKOBMYO   0.046928   0.201021   0.233   0.8154    
## speciesMINOBT  -0.416711   0.212101  -1.965   0.0495 *  
## total_inf       0.066343   0.007516   8.827  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2827     0.3991  -5.719 1.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary of our model shows that treatmentshrub, speciesCARRUP, speciesMINOBT, speciesTRIDAS, and total_inf all appears to be significant indicators of total number of seeds.

# confidence intervals
confint(shrub_model6)
##                          2.5 %       97.5 %    Estimate
## cond.(Intercept)     1.7387449  2.573230737  2.15598779
## cond.treatmentshrub -0.5576313 -0.040868234 -0.29924976
## cond.speciesCARRUP  -2.2714556 -1.166099126 -1.71877737
## cond.speciesGEUROS  -0.7143430  0.236403207 -0.23896992
## cond.speciesKOBMYO  -0.3470654  0.440921896  0.04692825
## cond.speciesMINOBT  -0.8324210 -0.001001714 -0.41671134
## cond.total_inf       0.0516118  0.081073969  0.06634289
## zi.(Intercept)      -3.0649895 -1.500469125 -2.28272929
# adjusted R2
r.squaredGLMM(shrub_model6)
##                 R2m       R2c
## delta     0.7225478 0.7225478
## lognormal 0.7643760 0.7643760
## trigamma  0.6655456 0.6655456


Results


Table Format

# model object in table
shrub_model6 |> 
  as_flextable() |> 
  set_caption(
    as_paragraph(
      as_chunk("Table 1: This table displays what we studied earlier in the summary, 
      just in an easier to digest format.
      
      
               ", 
               props = fp_text_default(font.family = "Arial"))
    ), 
    
    word_stylename = "Table Caption"
  )
Table 1: This table displays what we studied earlier in the summary,
just in an easier to digest format.


Estimate

Standard Error

statistic

p-value

Fixed effects

(Intercept)

2.156

0.213

10.128

0.0000

***

treatmentshrub

-0.299

0.132

-2.270

0.0232

*

speciesCARRUP

-1.719

0.282

-6.095

0.0000

***

speciesGEUROS

-0.239

0.243

-0.985

0.3245

speciesKOBMYO

0.047

0.201

0.233

0.8154

speciesMINOBT

-0.417

0.212

-1.965

0.0495

*

total_inf

0.066

0.008

8.827

0.0000

***

(Intercept)

-2.283

0.399

-5.719

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

square root of the estimated residual variance: 2.2

data's log-likelihood under the model: -548.9

Akaike Information Criterion: 1,115.7

Bayesian Information Criterion: 1,144.5


Visualization


# Compute predictions
predictions <- ggpredict(shrub_model6, terms = c("treatment", "species", "total_inf"))
# Plot showing seed differences by Species in the ACTUAL DATA
ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
  geom_boxplot(aes(color = treatment), alpha = 0.5) +
  facet_wrap(~ species) +
  scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
  scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
  theme_bw() +
  labs(
    x = "",
    y = "Number of Seeds",
    title = "Measured Number of Seeds per Species",
    caption = "Figure X: This figure compares the measured number of seeds per species. The species AREFEN appears
                to have the highest number of seeds, whereas CARRUP appears to have the lowest number of 
                seeds. The AREFEN species has largest spread, likely because it was the species with the
                least amount of samples (apart from TRIDAS, which we removed)."
  ) + 
  coord_cartesian(ylim = c(0, 70)) +
  theme(plot.caption = element_text(hjust=0),
        legend.position = 'none')


# Predictions
pred <- ggpredict(shrub_model6, terms = "species", back.transform = TRUE)

# Plot showing seed differences by Species      
plot(pred) +
  labs(
    x = "",
    y = "Number of Seeds",
    title = "Comparing Estimated Number of Seeds per Species",
    caption = "Figure X: This figure compares the estimated number of seeds per species. The species AREFEN and
                KOBMYO appears to have the highest number of seeds, whereas CARRUP appears to have 
                the lowest number of seeds. The AREFEN species has largest spread, likely because it 
                was the species with the least amount of samples. These estimated results are very 
                similar to what we saw in the measured data." 
  ) +
  theme(plot.caption = element_text(hjust=0))


# Viz showing measured seed difference by plot type 
plottype1 <- ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
  geom_boxplot(aes(color = treatment), alpha = 0.5) +
  scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
  scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
  theme_bw() +
  labs(
    x = "Species",
    y = "Number of Seeds",
    title = "Comparing Measured Number of Seeds per Species"
  ) + 
  coord_cartesian(ylim = c(0, 40)) + 
  theme(legend.position = "none")

# Predictions
pred2 <- ggpredict(shrub_model6, terms = "treatment", back.transform = TRUE)

# Viz showing estimated seed difference by plot type     
plottype2 <- plot(pred2) +
  labs(
    x = "Plot Type",
    y = "Number of Seeds",
    title = "Comparing Estimated Number of Seeds per Plot Type"
  ) + 
  coord_cartesian(ylim = c(0, 40))

# making viz side by side
plottype_fig <- ggarrange(plottype1 + rremove("xlab") + rremove('ylab'), 
          plottype2 + rremove("xlab") + rremove('ylab'), 
          ncol = 2,
          labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
          vjust = 5, # adjusting figure label position
          hjust = -4)

annotate_figure(plottype_fig,
                left = textGrob("Number of Seeds", # add universal y-axis
                        rot = 90, vjust = 0.5, # styling
                        gp = gpar(cex = 1.3)),
                bottom = textGrob("",
                        gp = gpar(cex = 1.3))) +
  labs(
    caption = "
    
    Figure X: This figure shows the difference in the number of seeds per plot type, for both the
                measured data, as well as the data estimated from our model. The model manages 
                to predict which plot results in higher seed count, but it fails to capture the
                variability." 
  ) +
  theme(plot.caption = element_text(hjust=0))


inf1 <- ggplot(shrub_seed_sub, aes(x = total_inf, y = num_seeds)) +
  geom_point() +
  theme_bw() +
  labs(
    x = "Number of inflorescences",
    y = "Number of Seeds",
    title = "Comparing Measured Number of Seeds to Number of inflorescences"
  ) + 
  coord_cartesian(xlim = c(0, 60),
                  ylim = c(0, 500))

# Predictions
pred3 <- ggpredict(shrub_model6, terms = "total_inf", back.transform = TRUE)

# Plot showing seed differences by Species      
inf2 <- plot(pred3) +
  labs(
    x = "Number of inflorescences",
    y = "Number of Seeds",
    title = "Comparing Estimated Number of Seeds to Number of inflorescences"
  ) + 
  coord_cartesian(xlim = c(0, 60),
                  ylim = c(0, 500))

# making viz side by side
inf_fig <- ggarrange(inf1 + rremove("xlab") + rremove('ylab'), 
          inf2 + rremove("xlab") + rremove('ylab'), 
          ncol = 2,
          labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
          vjust = 5, # adjusting figure label position
          hjust = -4)

annotate_figure(inf_fig,
                left = textGrob("Number of Seeds", # add universal y-axis
                        rot = 90, vjust = 0.5, # styling
                        gp = gpar(cex = 1.3)),
                bottom = textGrob("Number of inflorescences",
                        gp = gpar(cex = 1.3))) +
  labs(
    caption = "
    
    Figure X: This figure shows how the number of seeds varies as the number of inflorescences
                increases. The model appears to do a good job at predicting when you compare it
                to the measured data." 
  ) +
  theme(plot.caption = element_text(hjust=0))


To answer the question we are studying, how does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences, we need to look at the effects of each of the variables.

The first pair of plots show that different species result in a wide variety in the seed count. In general, our model tends to underestimate the number of seeds for all of the species. However, it does correctly predict which species result in a larger number of seeds or a smaller number of seeds, which we believe to be the more valuable insight.

The second pair of plots show that the number of seeds is generally higher in the open (control) plot. It does this well, correctly predicting that the open plot typically has a higher number of seeds compared to the shrub plot.

Lastly, the third pair of plots show that the number of seeds increases as the number of total inflorescences increases. Because of the lack of samples with high inflorescences, the confidence interval grows significantly on the latter part of the plot. However, where we have a lot of data points, the model predicts the number of seeds very well.


Now that we know how seed count varies with plot type, species, and total number of inflorescences, we will investigate to see if we can predict seed count with a simpler model.


Investigating a Simpler Model